Search CORE

57 research outputs found

Dynamic Graph Stream Algorithms in $o(n)$ Space

Author: Huang Zengfeng
Peng Pan
Publication venue
Publication date: 01/01/2016
Field of study

In this paper we study graph problems in dynamic streaming model, where the input is defined by a sequence of edge insertions and deletions. As many natural problems require

\Omega(n)

space, where

n

is the number of vertices, existing works mainly focused on designing

\tilde{O}(n)

space algorithms. Although sublinear in the number of edges for dense graphs, it could still be too large for many applications (e.g.

n

is huge or the graph is sparse). In this work, we give single-pass algorithms beating this space barrier for two classes of problems. We present

o(n)

space algorithms for estimating the number of connected components with additive error

\varepsilon n

and

(1+\varepsilon)

-approximating the weight of minimum spanning tree, for any small constant

\varepsilon>0

. The latter improves previous

\tilde{O}(n)

space algorithm given by Ahn et al. (SODA 2012) for connected graphs with bounded edge weights. We initiate the study of approximate graph property testing in the dynamic streaming model, where we want to distinguish graphs satisfying the property from graphs that are

\varepsilon

-far from having the property. We consider the problem of testing

k

-edge connectivity,

k

-vertex connectivity, cycle-freeness and bipartiteness (of planar graphs), for which, we provide algorithms using roughly

\tilde{O}(n^{1-\varepsilon})

space, which is

o(n)

for any constant

\varepsilon

. To complement our algorithms, we present

\Omega(n^{1-O(\varepsilon)})

space lower bounds for these problems, which show that such a dependence on

\varepsilon

is necessary.Comment: ICALP 201

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

UNSWorks

Randomized Algorithms for Tracking Distributed Count, Frequencies, and Ranks

Author: Huang Zengfeng
Yi Ke
Zhang Qin
Publication venue
Publication date: 02/12/2011
Field of study

We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the {\em count-tracking} problem, where there are

k

players, each holding a counter

n_i

that gets incremented over time, and the goal is to track an \eps-approximation of their sum

n=\sum_i n_i

continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is \Theta(k/\eps \cdot \log N), where

N

is the final value of

n

when the tracking finishes, we show that with randomization, the communication cost can be reduced to \Theta(\sqrt{k}/\eps \cdot \log N). Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: {\em frequency-tracking} and {\em rank-tracking}, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.Comment: 19 pages, 1 figur

arXiv.org e-Print Archive

Hong Kong University of Science and Technology Institutional Repository

Optimal Clustering with Noisy Queries via Multi-Armed Bandit

Author: Huang Zengfeng
Xia Jinghui
Publication venue
Publication date: 12/07/2022
Field of study

Motivated by many applications, we study clustering with a faulty oracle. In this problem, there are

n

items belonging to

k

unknown clusters, and the algorithm is allowed to ask the oracle whether two items belong to the same cluster or not. However, the answer from the oracle is correct only with probability

\frac{1}{2}+\frac{\delta}{2}

. The goal is to recover the hidden clusters with minimum number of noisy queries. Previous works have shown that the problem can be solved with

O(\frac{nk\log n}{\delta^2} + \text{poly}(k,\frac{1}{\delta}, \log n))

queries, while

\Omega(\frac{nk}{\delta^2})

queries is known to be necessary. So, for any values of

k

and

\delta

, there is still a non-trivial gap between upper and lower bounds. In this work, we obtain the first matching upper and lower bounds for a wide range of parameters. In particular, a new polynomial time algorithm with

O(\frac{n(k+\log n)}{\delta^2} + \text{poly}(k,\frac{1}{\delta}, \log n))

queries is proposed. Moreover, we prove a new lower bound of

\Omega(\frac{n\log n}{\delta^2})

, which, combined with the existing

\Omega(\frac{nk}{\delta^2})

bound, matches our upper bound up to an additive

\text{poly}(k,\frac{1}{\delta},\log n)

term. To obtain the new results, our main ingredient is an interesting connection between our problem and multi-armed bandit, which might provide useful insights for other similar problems.Comment: ICML 202

arXiv.org e-Print Archive

Dynamic Self-training Framework for Graph Convolutional Networks

Author: Huang Zengfeng
Zhang Shenzhong
Zhou Ziang
Publication venue
Publication date: 07/10/2019
Field of study

Graph neural networks (GNN) such as GCN, GAT, MoNet have achieved state-of-the-art results on semi-supervised learning on graphs. However, when the number of labeled nodes is very small, the performances of GNNs downgrade dramatically. Self-training has proved to be effective for resolving this issue, however, the performance of self-trained GCN is still inferior to that of G2G and DGI for many settings. Moreover, additional model complexity make it more difficult to tune the hyper-parameters and do model selection. We argue that the power of self-training is still not fully explored for the node classification task. In this paper, we propose a unified end-to-end self-training framework called \emph{Dynamic Self-traning}, which generalizes and simplifies prior work. A simple instantiation of the framework based on GCN is provided and empirical results show that our framework outperforms all previous methods including GNNs, embedding based method and self-trained GCNs by a noticeable margin. Moreover, compared with standard self-training, hyper-parameter tuning for our framework is easier.Comment: 11page

arXiv.org e-Print Archive

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

Author: Huang Zengfeng
Yang Yang
Zhang Wenjie
Zhang Ying
Publication venue
Publication date: 03/09/2018
Field of study

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q intersect X|/ |Q|. Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q are not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a good trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-of-the-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on some real-life dataset

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Communication complexity of approximate maximum matching in the message-passing model

Author: Huang Zengfeng
Radunovic Bozidar
Vojnovic Milan
Zhang Qin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2020
Field of study

We consider the communication complexity of finding an approximate maximum matching in a graph in a multi-party message-passing communication model. The maximum matching problem is one of the most fundamental graph combinatorial problems, with a variety of applications. The input to the problem is a graph G that has n vertices and the set of edges partitioned over k sites, and an approximation ratio parameter α. The output is required to be a matching in G that has to be reported by one of the sites, whose size is at least factor α of the size of a maximum matching in G. We show that the communication complexity of this problem is Ω(α2kn)information bits. This bound is shown to be tight up to a log n factor, by constructing an algorithm, establishing its correctness, and an upper bound on the communication cost. The lower bound also applies to other graph combinatorial problems in the message-passing communication model, including max-flow and graph sparsification

LSE Research Online